Computer Vision and Pattern Recognition 52
♻ ☆ GRACE: Graph-Regularized Attentive Convolutional Entanglement with Laplacian Smoothing for Robust DeepFake Video Detection
As DeepFake video manipulation techniques escalate, posing profound threats,
the urgent need to develop efficient detection strategies is underscored.
However, one particular issue lies with facial images being mis-detected, often
originating from degraded videos or adversarial attacks, leading to unexpected
temporal artifacts that can undermine the efficacy of DeepFake video detection
techniques. This paper introduces a novel method for robust DeepFake video
detection, harnessing the power of the proposed Graph-Regularized Attentive
Convolutional Entanglement (GRACE) based on the graph convolutional network
with graph Laplacian to address the aforementioned challenges. First,
conventional Convolution Neural Networks are deployed to perform spatiotemporal
features for the entire video. Then, the spatial and temporal features are
mutually entangled by constructing a graph with sparse constraint, enforcing
essential features of valid face images in the noisy face sequences remaining,
thus augmenting stability and performance for DeepFake video detection.
Furthermore, the Graph Laplacian prior is proposed in the graph convolutional
network to remove the noise pattern in the feature space to further improve the
performance. Comprehensive experiments are conducted to illustrate that our
proposed method delivers state-of-the-art performance in DeepFake video
detection under noisy face sequences. The source code is available at
https://github.com/ming053l/GRACE.
comment: Submitted to TPAMI 2024
♻ ☆ EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation CVPR 2024
Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei Huang, Yali Wang, Tong Lu, Limin Wang, Yu Qiao
In this report, we present our solutions to the EgoVis Challenges in CVPR
2024, including five tracks in the Ego4D challenge and three tracks in the
EPIC-Kitchens challenge. Building upon the video-language two-tower model and
leveraging our meticulously organized egocentric video data, we introduce a
novel foundation model called EgoVideo. This model is specifically designed to
cater to the unique characteristics of egocentric videos and provides strong
support for our competition submissions. In the Ego4D challenges, we tackle
various tasks including Natural Language Queries, Step Grounding, Moment
Queries, Short-term Object Interaction Anticipation, and Long-term Action
Anticipation. In addition, we also participate in the EPIC-Kitchens challenge,
where we engage in the Action Recognition, Multiple Instance Retrieval, and
Domain Adaptation for Action Recognition tracks. By adapting EgoVideo to these
diverse tasks, we showcase its versatility and effectiveness in different
egocentric video analysis scenarios, demonstrating the powerful representation
ability of EgoVideo as an egocentric foundation model. Our codebase and
pretrained models are publicly available at
https://github.com/OpenGVLab/EgoVideo.
comment: Champion solutions in the EgoVis CVPR 2024 workshop
♻ ☆ A Survey on Deep Clustering: From the Prior Perspective
Facilitated by the powerful feature extraction ability of neural networks,
deep clustering has achieved great success in analyzing high-dimensional and
complex real-world data. The performance of deep clustering methods is affected
by various factors such as network structures and learning objectives. However,
as pointed out in this survey, the essence of deep clustering lies in the
incorporation and utilization of prior knowledge, which is largely ignored by
existing works. From pioneering deep clustering methods based on data structure
assumptions to recent contrastive clustering methods based on data augmentation
invariances, the development of deep clustering intrinsically corresponds to
the evolution of prior knowledge. In this survey, we provide a comprehensive
review of deep clustering methods by categorizing them into six types of prior
knowledge. We find that in general the prior innovation follows two trends,
namely, i) from mining to constructing, and ii) from internal to external.
Besides, we provide a benchmark on five widely-used datasets and analyze the
performance of methods with diverse priors. By providing a novel prior
knowledge perspective, we hope this survey could provide some novel insights
and inspire future research in the deep clustering community.
♻ ☆ Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID CVPR 2024
Text-to-image person re-identification (ReID) retrieves pedestrian images
according to textual descriptions. Manually annotating textual descriptions is
time-consuming, restricting the scale of existing datasets and therefore the
generalization ability of ReID models. As a result, we study the transferable
text-to-image ReID problem, where we train a model on our proposed large-scale
database and directly deploy it to various datasets for evaluation. We obtain
substantial training data via Multi-modal Large Language Models (MLLMs).
Moreover, we identify and address two key challenges in utilizing the obtained
textual descriptions. First, an MLLM tends to generate descriptions with
similar structures, causing the model to overfit specific sentence patterns.
Thus, we propose a novel method that uses MLLMs to caption images according to
various templates. These templates are obtained using a multi-turn dialogue
with a Large Language Model (LLM). Therefore, we can build a large-scale
dataset with diverse textual descriptions. Second, an MLLM may produce
incorrect descriptions. Hence, we introduce a novel method that automatically
identifies words in a description that do not correspond with the image. This
method is based on the similarity between one text and all patch token
embeddings in the image. Then, we mask these words with a larger probability in
the subsequent training epoch, alleviating the impact of noisy textual
descriptions. The experimental results demonstrate that our methods
significantly boost the direct transfer text-to-image ReID performance.
Benefiting from the pre-trained model weights, we also achieve state-of-the-art
performance in the traditional evaluation settings.
comment: CVPR 2024
♻ ☆ VIPriors 4: Visual Inductive Priors for Data-Efficient Deep Learning Challenges
Robert-Jan Bruintjes, Attila Lengyel, Marcos Baptista Rios, Osman Semih Kayhan, Davide Zambrano, Nergis Tomen, Jan van Gemert
The fourth edition of the "VIPriors: Visual Inductive Priors for
Data-Efficient Deep Learning" workshop features two data-impaired challenges.
These challenges address the problem of training deep learning models for
computer vision tasks with limited data. Participants are limited to training
models from scratch using a low number of training samples and are not allowed
to use any form of transfer learning. We aim to stimulate the development of
novel approaches that incorporate inductive biases to improve the data
efficiency of deep learning models. Significant advancements are made compared
to the provided baselines, where winning solutions surpass the baselines by a
considerable margin in both tasks. As in previous editions, these achievements
are primarily attributed to heavy use of data augmentation policies and large
model ensembles, though novel prior-based methods seem to contribute more to
successful solutions compared to last year. This report highlights the key
aspects of the challenges and their outcomes.
♻ ☆ Framing image registration as a landmark detection problem for label-noise-aware task representation (HitR)
Diana Waldmannstetter, Ivan Ezhov, Benedikt Wiestler, Francesco Campi, Ivan Kukuljan, Stefan Ehrlich, Shankeeth Vinayahalingam, Bhakti Baheti, Satrajit Chakrabarty, Ujjwal Baid, Spyridon Bakas, Julian Schwarting, Marie Metz, Jan S. Kirschke, Daniel Rueckert, Rolf A. Heckemann, Marie Piraud, Bjoern H. Menze, Florian Kofler
Accurate image registration is pivotal in biomedical image analysis, where
selecting suitable registration algorithms demands careful consideration. While
numerous algorithms are available, the evaluation metrics to assess their
performance have remained relatively static. This study addresses this
challenge by introducing a novel evaluation metric termed Landmark Hit Rate
(HitR), which focuses on the clinical relevance of image registration accuracy.
Unlike traditional metrics such as Target Registration Error, which emphasize
subresolution differences, HitR considers whether registration algorithms
successfully position landmarks within defined confidence zones. This paradigm
shift acknowledges the inherent annotation noise in medical images, allowing
for more meaningful assessments. To equip HitR with label-noise-awareness, we
propose defining these confidence zones based on an Inter-rater Variance
analysis. Consequently, hit rate curves are computed for varying landmark zone
sizes, enabling performance measurement for a task-specific level of accuracy.
Our approach offers a more realistic and meaningful assessment of image
registration algorithms, reflecting their suitability for clinical and
biomedical applications.
♻ ☆ Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP
Image-text contrastive models like CLIP have wide applications in zero-shot
classification, image-text retrieval, and transfer learning. However, they
often struggle on compositional visio-linguistic tasks (e.g., attribute-binding
or object-relationships) where their performance is no better than random
chance. To address this, we introduce SDS-CLIP, a lightweight and
sample-efficient distillation method to enhance CLIP's compositional
visio-linguistic reasoning. Our approach fine-tunes CLIP using a distillation
objective borrowed from large text-to-image generative models like
Stable-Diffusion, which are known for their strong visio-linguistic reasoning
abilities. On the challenging Winoground benchmark, SDS-CLIP improves the
visio-linguistic performance of various CLIP models by up to 7%, while on the
ARO dataset, it boosts performance by up to 3%. This work underscores the
potential of well-designed distillation objectives from generative models to
enhance contrastive image-text models with improved visio-linguistic reasoning
capabilities.
comment: Short paper
♻ ☆ Fine-tuning can cripple your foundation model; preserving features may be the solution
Pre-trained foundation models, due to their enormous capacity and exposure to
vast amounts of data during pre-training, are known to have learned plenty of
real-world concepts. An important step in making these pre-trained models
effective on downstream tasks is to fine-tune them on related datasets. While
various fine-tuning methods have been devised and have been shown to be highly
effective, we observe that a fine-tuned model's ability to recognize concepts
on tasks $\textit{different}$ from the downstream one is reduced significantly
compared to its pre-trained counterpart. This is an undesirable effect of
fine-tuning as a substantial amount of resources was used to learn these
pre-trained concepts in the first place. We call this phenomenon ''concept
forgetting'' and via experiments show that most end-to-end fine-tuning
approaches suffer heavily from this side effect. To this end, we propose a
simple fix to this problem by designing a new fine-tuning method called
$\textit{LDIFS}$ (short for $\ell_2$ distance in feature space) that, while
learning new concepts related to the downstream task, allows a model to
preserve its pre-trained knowledge as well. Through extensive experiments on 10
fine-tuning tasks we show that $\textit{LDIFS}$ significantly reduces concept
forgetting. Additionally, we show that LDIFS is highly effective in performing
continual fine-tuning on a sequence of tasks as well, in comparison with both
fine-tuning as well as continual learning baselines.
comment: Published in TMLR: https://openreview.net/forum?id=kfhoeZCeW7
♻ ☆ Towards objective and systematic evaluation of bias in artificial intelligence for medical imaging
Emma A. M. Stanley, Raissa Souza, Anthony Winder, Vedant Gulve, Kimberly Amador, Matthias Wilms, Nils D. Forkert
Artificial intelligence (AI) models trained using medical images for clinical
tasks often exhibit bias in the form of disparities in performance between
subgroups. Since not all sources of biases in real-world medical imaging data
are easily identifiable, it is challenging to comprehensively assess how those
biases are encoded in models, and how capable bias mitigation methods are at
ameliorating performance disparities. In this article, we introduce a novel
analysis framework for systematically and objectively investigating the impact
of biases in medical images on AI models. We developed and tested this
framework for conducting controlled in silico trials to assess bias in medical
imaging AI using a tool for generating synthetic magnetic resonance images with
known disease effects and sources of bias. The feasibility is showcased by
using three counterfactual bias scenarios to measure the impact of simulated
bias effects on a convolutional neural network (CNN) classifier and the
efficacy of three bias mitigation strategies. The analysis revealed that the
simulated biases resulted in expected subgroup performance disparities when the
CNN was trained on the synthetic datasets. Moreover, reweighing was identified
as the most successful bias mitigation strategy for this setup, and we
demonstrated how explainable AI methods can aid in investigating the
manifestation of bias in the model using this framework. Developing fair AI
models is a considerable challenge given that many and often unknown sources of
biases can be present in medical imaging datasets. In this work, we present a
novel methodology to objectively study the impact of biases and mitigation
strategies on deep learning pipelines, which can support the development of
clinical AI that is robust and responsible.
comment: Published in the Journal of the American Medical Informatics
Association
♻ ☆ Evaluation of Deep Learning Semantic Segmentation for Land Cover Mapping on Multispectral, Hyperspectral and High Spatial Aerial Imagery
In the rise of climate change, land cover mapping has become such an urgent
need in environmental monitoring. The accuracy of land cover classification has
gotten increasingly based on the improvement of remote sensing data. Land cover
classification using satellite imageries has been explored and become more
prevalent in recent years, but the methodologies remain some drawbacks of
subjective and time-consuming. Some deep learning techniques have been utilized
to overcome these limitations. However, most studies implemented just one image
type to evaluate algorithms for land cover mapping. Therefore, our study
conducted deep learning semantic segmentation in multispectral, hyperspectral,
and high spatial aerial image datasets for landcover mapping. This research
implemented a semantic segmentation method such as Unet, Linknet, FPN, and
PSPnet for categorizing vegetation, water, and others (i.e., soil and
impervious surface). The LinkNet model obtained high accuracy in IoU
(Intersection Over Union) at 0.92 in all datasets, which is comparable with
other mentioned techniques. In evaluation with different image types, the
multispectral images showed higher performance with the IoU, and F1-score are
0.993 and 0.997, respectively. Our outcome highlighted the efficiency and broad
applicability of LinkNet and multispectral image on land cover classification.
This research contributes to establishing an approach on landcover segmentation
via open source for long-term future application.
comment: conference, This preprint is based on the following published
conference article: Panuntun, I. A., Chen, Y.-N., Jamaluddin, I., & Tran, T.
L. C., 2023. Evaluation of Deep Learning Semantic Segmentation for Land Cover
Mapping on Multispectral, Hyperspectral and High Spatial Aerial Imagery. 44th
Asian Conference on Remote Sensing, ACRS 2023. Code 198676
♻ ☆ Bytes Are All You Need: Transformers Operating Directly On File Bytes
Modern deep learning approaches usually utilize modality-specific processing.
For example, the most common deep learning approach to image classification
involves decoding image file bytes into an RGB tensor which is passed into a
neural network. Instead, we investigate modality-independent representation
learning by performing classification directly on file bytes, without the need
for decoding files at inference time. This enables models to operate on various
modalities without any hand-designed, modality-specific processing. Our model,
ByteFormer, improves ImageNet Top-1 classification accuracy by $5\%$ (from
$72.2\%$ to $77.33\%$) relative to DeIT models of similar size. Compared to
Perceiver IO, our model requires absolutely no modality-specific processing at
inference time, and uses an order of magnitude fewer parameters at equivalent
accuracy on ImageNet. We demonstrate that the same ByteFormer architecture can
perform audio classification without modifications or modality-specific
preprocessing. We achieve $95.42\%$ classification accuracy on the Speech
Commands V2 dataset (comparable to the state-of-the-art accuracy of $98.7\%$).
Additionally, we demonstrate that ByteFormer can operate jointly on images and
audio, handling joint classification without explicit knowledge of the input
modality. We release our code at
https://github.com/apple/corenet/tree/main/projects/byteformer.
♻ ☆ A Geometric Algorithm for Tubular Shape Reconstruction from Skeletal Representation
We introduce a novel approach for the reconstruction of tubular shapes from
skeletal representations. Our method processes all skeletal points as a whole,
eliminating the need for splitting input structure into multiple segments. We
represent the tubular shape as a truncated signed distance function (TSDF) in a
voxel hashing manner, in which the signed distance between a voxel center and
the object is computed through a simple geometric algorithm. Our method does
not involve any surface sampling scheme or solving large matrix equations, and
therefore is a faster and more elegant solution for tubular shape
reconstruction compared to other approaches. Experiments demonstrate the
efficiency and effectiveness of the proposed method. Code is avaliable at
https://github.com/wlsdzyzl/Dragon.
comment: 9 pages (without reference), 6 figures
♻ ☆ Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models UAI 2024
For downstream applications of vision-language pre-trained models, there has
been significant interest in constructing effective prompts. Existing works on
prompt engineering, which either require laborious manual designs or optimize
the prompt tuning as a point estimation problem, may fail to describe diverse
characteristics of categories and limit their applications. We introduce a
Bayesian probabilistic resolution to prompt tuning, where the label-specific
stochastic prompts are generated hierarchically by first sampling a latent
vector from an underlying distribution and then employing a lightweight
generative model. Importantly, we semantically regularize the tuning process by
minimizing the statistical distance between the visual patches and linguistic
prompts, which pushes the stochastic label representations to faithfully
capture diverse visual concepts, instead of overfitting the training
categories. We evaluate the effectiveness of our approach on four tasks:
few-shot image recognition, base-to-new generalization, dataset transfer
learning, and domain shifts. Extensive results over 15 datasets show promising
transferability and generalization performance of our proposed model, both
quantitatively and qualitatively.
comment: Accepted by UAI 2024
♻ ☆ Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts
Recent successes suggest that parameter-efficient fine-tuning of foundation
models as the state-of-the-art method for transfer learning in vision,
replacing the rich literature of alternatives such as meta-learning. In trying
to harness the best of both worlds, meta-tuning introduces a subsequent
optimization stage of foundation models but has so far only shown limited
success and crucially tends to underperform on out-of-distribution (OOD) tasks.
In this paper, we introduce Sparse MetA-Tuning (SMAT), a method inspired by
sparse mixture-of-experts approaches and trained to isolate subsets of
pre-trained parameters automatically for meta-tuning on each task. SMAT
successfully overcomes OOD sensitivity and delivers on the promise of enhancing
the transfer abilities of vision foundation models beyond parameter-efficient
fine-tuning. We establish new state-of-the-art results on a challenging
combination of Meta-Dataset augmented with additional OOD tasks in both
zero-shot and gradient-based adaptation settings. In addition, we provide a
thorough analysis of the superiority of learned over hand-designed sparsity
patterns for sparse expert methods and the pivotal importance of the sparsity
level in balancing between in-distribution and out-of-distribution
generalization. Our code is publicly available.
comment: The Forty-first International Conference on Machine Learning, 2024
♻ ☆ An Efficient Instance Segmentation Framework Based on Oriented Bounding Boxes
Instance segmentation for completely occluded objects and dense objects in
robot vision measurement are two challenging tasks. To uniformly deal with
them, this paper proposes a unified coarse-to-fine instance segmentation
framework, CFNet, which uses box prompt-based segmentation foundation models
(BSMs), e.g., Segment Anything Model. Specifically, CFNet first detects
oriented bounding boxes (OBBs) to distinguish instances and provide coarse
localization information. Then, it predicts OBB prompt-related masks for fine
segmentation. CFNet performs instance segmentation with OBBs that only contain
partial object boundaries on occluders to predict occluded object instances,
which overcomes the difficulty of existing amodal instance segmentation methods
in directly predicting occluded objects. In addition, since OBBs only serve as
prompts, CFNet alleviates the over-dependence on bounding box detection
performance of current instance segmentation methods using OBBs for dense
objects. Moreover, to enable BSMs to handle OBB prompts, we propose a novel OBB
prompt encoder. To make CFNet more lightweight, we perform knowledge
distillation on it and introduce a Gaussian label smoothing method for teacher
model outputs. Experiments demonstrate that CFNet outperforms current instance
segmentation methods on both industrial and public datasets. The code is
available at https://github.com/zhen6618/OBBInstanceSegmentation.
♻ ☆ DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance
Prior material creation methods had limitations in producing diverse results
mainly because reconstruction-based methods relied on real-world measurements
and generation-based methods were trained on relatively small material
datasets. To address these challenges, we propose DreamPBR, a novel
diffusion-based generative framework designed to create spatially-varying
appearance properties guided by text and multi-modal controls, providing high
controllability and diversity in material generation. Key to achieving diverse
and high-quality PBR material generation lies in integrating the capabilities
of recent large-scale vision-language models trained on billions of text-image
pairs, along with material priors derived from hundreds of PBR material
samples. We utilize a novel material Latent Diffusion Model (LDM) to establish
the mapping between albedo maps and the corresponding latent space. The latent
representation is then decoded into full SVBRDF parameter maps using a
rendering-aware PBR decoder. Our method supports tileable generation through
convolution with circular padding. Furthermore, we introduce a multi-modal
guidance module, which includes pixel-aligned guidance, style image guidance,
and 3D shape guidance, to enhance the control capabilities of the material LDM.
We demonstrate the effectiveness of DreamPBR in material creation, showcasing
its versatility and user-friendliness on a wide range of controllable
generation and editing applications.
comment: 16 pages, 17 figures
♻ ☆ Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt
In the realm of large vision language models (LVLMs), jailbreak attacks serve
as a red-teaming approach to bypass guardrails and uncover safety implications.
Existing jailbreaks predominantly focus on the visual modality, perturbing
solely visual inputs in the prompt for attacks. However, they fall short when
confronted with aligned models that fuse visual and textual features
simultaneously for generation. To address this limitation, this paper
introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes
jailbreaks by optimizing textual and visual prompts cohesively. Initially, we
adversarially embed universally harmful perturbations in an image, guided by a
few-shot query-agnostic corpus (e.g., affirmative prefixes and negative
inhibitions). This process ensures that image prompt LVLMs to respond
positively to any harmful queries. Subsequently, leveraging the adversarial
image, we optimize textual prompts with specific harmful intent. In particular,
we utilize a large language model to analyze jailbreak failures and employ
chain-of-thought reasoning to refine textual prompts through a
feedback-iteration manner. To validate the efficacy of our approach, we
conducted extensive evaluations on various datasets and LVLMs, demonstrating
that our method significantly outperforms other methods by large margins
(+29.03% in attack success rate on average). Additionally, we showcase the
potential of our attacks on black-box commercial LVLMs, such as Gemini and
ChatGLM.
♻ ☆ Topo4D: Topology-Preserving Gaussian Splatting for High-Fidelity 4D Head Capture
4D head capture aims to generate dynamic topological meshes and corresponding
texture maps from videos, which is widely utilized in movies and games for its
ability to simulate facial muscle movements and recover dynamic textures in
pore-squeezing. The industry often adopts the method involving multi-view
stereo and non-rigid alignment. However, this approach is prone to errors and
heavily reliant on time-consuming manual processing by artists. To simplify
this process, we propose Topo4D, a novel framework for automatic geometry and
texture generation, which optimizes densely aligned 4D heads and 8K texture
maps directly from calibrated multi-view time-series images. Specifically, we
first represent the time-series faces as a set of dynamic 3D Gaussians with
fixed topology in which the Gaussian centers are bound to the mesh vertices.
Afterward, we perform alternative geometry and texture optimization
frame-by-frame for high-quality geometry and texture learning while maintaining
temporal topology stability. Finally, we can extract dynamic facial meshes in
regular wiring arrangement and high-fidelity textures with pore-level details
from the learned Gaussians. Extensive experiments show that our method achieves
superior results than the current SOTA face reconstruction methods both in the
quality of meshes and textures. Project page:
https://xuanchenli.github.io/Topo4D/.
♻ ☆ Instruction-Guided Scene Text Recognition
Multi-modal models show appealing performance in visual recognition tasks
recently, as free-form text-guided training evokes the ability to understand
fine-grained visual content. However, current models are either inefficient or
cannot be trivially upgraded to scene text recognition (STR) due to the
composition difference between natural and text images. We propose a novel
instruction-guided scene text recognition (IGTR) paradigm that formulates STR
as an instruction learning problem and understands text images by predicting
character attributes, e.g., character frequency, position, etc. IGTR first
devises $\left \langle condition,question,answer\right \rangle$ instruction
triplets, providing rich and diverse descriptions of character attributes. To
effectively learn these attributes through question-answering, IGTR develops
lightweight instruction encoder, cross-modal feature fusion module and
multi-task answer head, which guides nuanced text image understanding.
Furthermore, IGTR realizes different recognition pipelines simply by using
different instructions, enabling a character-understanding-based text reasoning
paradigm that considerably differs from current methods. Experiments on English
and Chinese benchmarks show that IGTR outperforms existing models by
significant margins, while maintaining a small model size and efficient
inference speed. Moreover, by adjusting the sampling of instructions, IGTR
offers an elegant way to tackle the recognition of both rarely appearing and
morphologically similar characters, which were previous challenges. Code at
\href{https://github.com/Topdu/OpenOCR}{this http URL}.
♻ ☆ Local-Aware Global Attention Network for Person Re-Identification Based on Body and Hand Images
Learning representative, robust and discriminative information from images is
essential for effective person re-identification (Re-Id). In this paper, we
propose a compound approach for end-to-end discriminative deep feature learning
for person Re-Id based on both body and hand images. We carefully design the
Local-Aware Global Attention Network (LAGA-Net), a multi-branch deep network
architecture consisting of one branch for spatial attention, one branch for
channel attention, one branch for global feature representations and another
branch for local feature representations. The attention branches focus on the
relevant features of the image while suppressing the irrelevant backgrounds. In
order to overcome the weakness of the attention mechanisms, equivariant to
pixel shuffling, we integrate relative positional encodings into the spatial
attention module to capture the spatial positions of pixels. The global branch
intends to preserve the global context or structural information. For the the
local branch, which intends to capture the fine-grained information, we perform
uniform partitioning to generate stripes on the conv-layer horizontally. We
retrieve the parts by conducting a soft partition without explicitly
partitioning the images or requiring external cues such as pose estimation. A
set of ablation study shows that each component contributes to the increased
performance of the LAGA-Net. Extensive evaluations on four popular body-based
person Re-Id benchmarks and two publicly available hand datasets demonstrate
that our proposed method consistently outperforms existing state-of-the-art
methods.
comment: arXiv admin note: substantial text overlap with arXiv:2108.02234
♻ ☆ CILF-CIAE: CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation
The age estimation task aims to predict the age of an individual by analyzing
facial features in an image. The development of age estimation can improve the
efficiency and accuracy of various applications (e.g., age verification and
secure access control, etc.). In recent years, contrastive language-image
pre-training (CLIP) has been widely used in various multimodal tasks and has
made some progress in the field of age estimation. However, existing CLIP-based
age estimation methods require high memory usage (quadratic complexity) when
globally modeling images, and lack an error feedback mechanism to prompt the
model about the quality of age prediction results. To tackle the above issues,
we propose a novel CLIP-driven Image-Language Fusion for Correcting Inverse Age
Estimation (CILF-CIAE). Specifically, we first introduce the CLIP model to
extract image features and text semantic information respectively, and map them
into a highly semantically aligned high-dimensional feature space. Next, we
designed a new Transformer architecture (i.e., FourierFormer) to achieve
channel evolution and spatial interaction of images, and to fuse image and text
semantic information. Compared with the quadratic complexity of the attention
mechanism, the proposed Fourierformer is of linear log complexity. To further
narrow the semantic gap between image and text features, we utilize an
efficient contrastive multimodal learning module that supervises the multimodal
fusion process of FourierFormer through contrastive loss for image-text
matching, thereby improving the interaction effect between different
modalities. Finally, we introduce reversible age estimation, which uses
end-to-end error feedback to reduce the error rate of age predictions. Through
extensive experiments on multiple data sets, CILF-CIAE has achieved better age
prediction results.
comment: 14 pages, 14 figures, 3 tables
♻ ☆ WIA-LD2ND: Wavelet-based Image Alignment for Self-supervised Low-Dose CT Denoising MICCAI2024
In clinical examinations and diagnoses, low-dose computed tomography (LDCT)
is crucial for minimizing health risks compared with normal-dose computed
tomography (NDCT). However, reducing the radiation dose compromises the
signal-to-noise ratio, leading to degraded quality of CT images. To address
this, we analyze LDCT denoising task based on experimental results from the
frequency perspective, and then introduce a novel self-supervised CT image
denoising method called WIA-LD2ND, only using NDCT data. The proposed WIA-LD2ND
comprises two modules: Wavelet-based Image Alignment (WIA) and Frequency-Aware
Multi-scale Loss (FAM). First, WIA is introduced to align NDCT with LDCT by
mainly adding noise to the high-frequency components, which is the main
difference between LDCT and NDCT. Second, to better capture high-frequency
components and detailed information, Frequency-Aware Multi-scale Loss (FAM) is
proposed by effectively utilizing multi-scale feature space. Extensive
experiments on two public LDCT denoising datasets demonstrate that our
WIA-LD2ND, only uses NDCT, outperforms existing several state-of-the-art
weakly-supervised and self-supervised methods. Source code is available at
https://github.com/zhaohaoyu376/WI-LD2ND.
comment: MICCAI2024
♻ ☆ MoreStyle: Relax Low-frequency Constraint of Fourier-based Image Reconstruction in Generalizable Medical Image Segmentation MICCAI2024
The task of single-source domain generalization (SDG) in medical image
segmentation is crucial due to frequent domain shifts in clinical image
datasets. To address the challenge of poor generalization across different
domains, we introduce a Plug-and-Play module for data augmentation called
MoreStyle. MoreStyle diversifies image styles by relaxing low-frequency
constraints in Fourier space, guiding the image reconstruction network. With
the help of adversarial learning, MoreStyle further expands the style range and
pinpoints the most intricate style combinations within latent features. To
handle significant style variations, we introduce an uncertainty-weighted loss.
This loss emphasizes hard-to-classify pixels resulting only from style shifts
while mitigating true hard-to-classify pixels in both MoreStyle-generated and
original images. Extensive experiments on two widely used benchmarks
demonstrate that the proposed MoreStyle effectively helps to achieve good
domain generalization ability, and has the potential to further boost the
performance of some state-of-the-art SDG methods. Source code is available at
https://github.com/zhaohaoyu376/morestyle.
comment: MICCAI2024
♻ ☆ Recovering the Pre-Fine-Tuning Weights of Generative Models ICML 2024
The dominant paradigm in generative modeling consists of two steps: i)
pre-training on a large-scale but unsafe dataset, ii) aligning the pre-trained
model with human values via fine-tuning. This practice is considered safe, as
no current method can recover the unsafe, pre-fine-tuning model weights. In
this paper, we demonstrate that this assumption is often false. Concretely, we
present Spectral DeTuning, a method that can recover the weights of the
pre-fine-tuning model using a few low-rank (LoRA) fine-tuned models. In
contrast to previous attacks that attempt to recover pre-fine-tuning
capabilities, our method aims to recover the exact pre-fine-tuning weights. Our
approach exploits this new vulnerability against large-scale models such as a
personalized Stable Diffusion and an aligned Mistral.
comment: ICML 2024. Project page: https://vision.huji.ac.il/spectral_detuning/
♻ ☆ Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models
Generalization is a main issue for current audio deepfake detectors, which
struggle to provide reliable results on out-of-distribution data. Given the
speed at which more and more accurate synthesis methods are developed, it is
very important to design techniques that work well also on data they were not
trained for. In this paper we study the potential of large-scale pre-trained
models for audio deepfake detection, with special focus on generalization
ability. To this end, the detection problem is reformulated in a speaker
verification framework and fake audios are exposed by the mismatch between the
voice sample under test and the voice of the claimed identity. With this
paradigm, no fake speech sample is necessary in training, cutting off any link
with the generation method at the root, and ensuring full generalization
ability. Features are extracted by general-purpose large pre-trained models,
with no need for training or fine-tuning on specific fake detection or speaker
verification datasets. At detection time only a limited set of voice fragments
of the identity under test is required. Experiments on several datasets
widespread in the community show that detectors based on pre-trained models
achieve excellent performance and show strong generalization ability, rivaling
supervised methods on in-distribution data and largely overcoming them on
out-of-distribution data.
♻ ☆ Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking
Empowered by transformer-based models, visual tracking has advanced
significantly. However, the slow speed of current trackers limits their
applicability on devices with constrained computational resources. To address
this challenge, we introduce ABTrack, an adaptive computation framework that
adaptively bypassing transformer blocks for efficient visual tracking. The
rationale behind ABTrack is rooted in the observation that semantic features or
relations do not uniformly impact the tracking task across all abstraction
levels. Instead, this impact varies based on the characteristics of the target
and the scene it occupies. Consequently, disregarding insignificant semantic
features or relations at certain abstraction levels may not significantly
affect the tracking accuracy. We propose a Bypass Decision Module (BDM) to
determine if a transformer block should be bypassed, which adaptively
simplifies the architecture of ViTs and thus speeds up the inference process.
To counteract the time cost incurred by the BDMs and further enhance the
efficiency of ViTs, we introduce a novel ViT pruning method to reduce the
dimension of the latent representation of tokens in each transformer block.
Extensive experiments on multiple tracking benchmarks validate the
effectiveness and generality of the proposed method and show that it achieves
state-of-the-art performance. Code is released at:
https://github.com/xyyang317/ABTrack.
♻ ☆ AdaCL:Adaptive Continual Learning
Class-Incremental Learning aims to update a deep classifier to learn new
categories while maintaining or improving its accuracy on previously observed
classes. Common methods to prevent forgetting previously learned classes
include regularizing the neural network updates and storing exemplars in
memory, which come with hyperparameters such as the learning rate,
regularization strength, or the number of exemplars. However, these
hyperparameters are usually only tuned at the start and then kept fixed
throughout the learning sessions, ignoring the fact that newly encountered
tasks may have varying levels of novelty or difficulty. This study investigates
the necessity of hyperparameter `adaptivity' in Class-Incremental Learning: the
ability to dynamically adjust hyperparameters such as the learning rate,
regularization strength, and memory size according to the properties of the new
task at hand. We propose AdaCL, a Bayesian Optimization-based approach to
automatically and efficiently determine the optimal values for those parameters
with each learning task. We show that adapting hyperpararmeters on each new
task leads to improvement in accuracy, forgetting and memory. Code is available
at https://github.com/ElifCerenGokYildirim/AdaCL.
comment: Published in 1st ContinualAI Unconference
♻ ☆ Woven Fabric Capture with a Reflection-Transmission Photo Pair SIGGRAPH 2024
Digitizing woven fabrics would be valuable for many applications, from
digital humans to interior design. Previous work introduces a lightweight woven
fabric acquisition approach by capturing a single reflection image and
estimating the fabric parameters with a differentiable geometric and shading
model. The renderings of the estimated fabric parameters can closely match the
photo; however, the captured reflection image is insufficient to fully
characterize the fabric sample reflectance. For instance, fabrics with
different thicknesses might have similar reflection images but lead to
significantly different transmission. We propose to recover the woven fabric
parameters from two captured images: reflection and transmission. At the core
of our method is a differentiable bidirectional scattering distribution
function (BSDF) model, handling reflection and transmission, including single
and multiple scattering. We propose a two-layer model, where the single
scattering uses an SGGX phase function as in previous work, and multiple
scattering uses a new azimuthally-invariant microflake definition, which we
term ASGGX. This new fabric BSDF model closely matches real woven fabrics in
both reflection and transmission. We use a simple setup for capturing
reflection and transmission photos with a cell phone camera and two point
lights, and estimate the fabric parameters via a lightweight network, together
with a differentiable optimization. We also model the out-of-focus effects
explicitly with a simple solution to match the thin-lens camera better. As a
result, the renderings of the estimated parameters can agree with the input
images on both reflection and transmission for the first time. The code for
this paper is at https://github.com/lxtyin/FabricBTDF-Recovery.
comment: 10 pages, 16 figures (in the main paper). Accepted by SIGGRAPH 2024
conference
♻ ☆ Towards Robust Physical-world Backdoor Attacks on Lane Detection
Deep learning-based lane detection (LD) plays a critical role in autonomous
driving systems, such as adaptive cruise control. However, it is vulnerable to
backdoor attacks. Existing backdoor attack methods on LD exhibit limited
effectiveness in dynamic real-world scenarios, primarily because they fail to
consider dynamic scene factors, including changes in driving perspectives
(e.g., viewpoint transformations) and environmental conditions (e.g., weather
or lighting changes). To tackle this issue, this paper introduces BadLANE, a
dynamic scene adaptation backdoor attack for LD designed to withstand changes
in real-world dynamic scene factors. To address the challenges posed by
changing driving perspectives, we propose an amorphous trigger pattern composed
of shapeless pixels. This trigger design allows the backdoor to be activated by
various forms or shapes of mud spots or pollution on the road or lens, enabling
adaptation to changes in vehicle observation viewpoints during driving. To
mitigate the effects of environmental changes, we design a meta-learning
framework to train meta-generators tailored to different environmental
conditions. These generators produce meta-triggers that incorporate diverse
environmental information, such as weather or lighting conditions, as the
initialization of the trigger patterns for backdoor implantation, thus enabling
adaptation to dynamic environments. Extensive experiments on various commonly
used LD models in both digital and physical domains validate the effectiveness
of our attacks, outperforming other baselines significantly (+25.15% on average
in Attack Success Rate). Our codes will be available upon paper publication.
♻ ☆ Training-Free Acceleration of ViTs with Delayed Spatial Merging ICML 2024
Token merging has emerged as a new paradigm that can accelerate the inference
of Vision Transformers (ViTs) without any retraining or fine-tuning. To push
the frontier of training-free acceleration in ViTs, we improve token merging by
adding the perspectives of 1) activation outliers and 2) hierarchical
representations. Through a careful analysis of the attention behavior in ViTs,
we characterize a delayed onset of the convergent attention phenomenon, which
makes token merging undesirable in the bottom blocks of ViTs. Moreover, we
augment token merging with a hierarchical processing scheme to capture
multi-scale redundancy between visual tokens. Combining these two insights, we
build a unified inference framework called DSM: Delayed Spatial Merging. We
extensively evaluate DSM on various ViT model scales (Tiny to Huge) and tasks
(ImageNet-1k and transfer learning), achieving up to 1.8$\times$ FLOP reduction
and 1.6$\times$ throughput speedup at a negligible loss while being two orders
of magnitude faster than existing methods.
comment: ICML 2024 ES-FoMo Workshop
♻ ☆ Multimodal Learning With Intraoperative CBCT & Variably Aligned Preoperative CT Data To Improve Segmentation MICCAI
Cone-beam computed tomography (CBCT) is an important tool facilitating
computer aided interventions, despite often suffering from artifacts that pose
challenges for accurate interpretation. While the degraded image quality can
affect downstream segmentation, the availability of high quality, preoperative
scans represents potential for improvements. Here we consider a setting where
preoperative CT and intraoperative CBCT scans are available, however, the
alignment (registration) between the scans is imperfect. We propose a
multimodal learning method that fuses roughly aligned CBCT and CT scans and
investigate the effect of CBCT quality and misalignment on the final
segmentation performance. For that purpose, we make use of a synthetically
generated data set containing real CT and synthetic CBCT volumes. As an
application scenario, we focus on liver and liver tumor segmentation. We show
that the fusion of preoperative CT and simulated, intraoperative CBCT mostly
improves segmentation performance (compared to using intraoperative CBCT only)
and that even clearly misaligned preoperative data has the potential to improve
segmentation performance.
comment: Submitted to SASHIMI2024 (MICCAI workshop)
♻ ☆ Fuzzy Attention-based Border Rendering Network for Lung Organ Segmentation MICCAI 2024
Automatic lung organ segmentation on CT images is crucial for lung disease
diagnosis. However, the unlimited voxel values and class imbalance of lung
organs can lead to false-negative/positive and leakage issues in advanced
methods. Additionally, some slender lung organs are easily lost during the
recycled down/up-sample procedure, e.g., bronchioles & arterioles, causing
severe discontinuity issue. Inspired by these, this paper introduces an
effective lung organ segmentation method called Fuzzy Attention-based Border
Rendering (FABR) network. Since fuzzy logic can handle the uncertainty in
feature extraction, hence the fusion of deep networks and fuzzy sets should be
a viable solution for better performance. Meanwhile, unlike prior top-tier
methods that operate on all regular dense points, our FABR depicts lung organ
regions as cube-trees, focusing only on recycle-sampled border vulnerable
points, rendering the severely discontinuous, false-negative/positive organ
regions with a novel Global-Local Cube-tree Fusion (GLCF) module. All
experimental results, on four challenging datasets of airway & artery,
demonstrate that our method can achieve the favorable performance
significantly.
comment: MICCAI 2024
♻ ☆ Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction
Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Masahiro Takahashi, Ryoma Niihara, Takayuki Okatani
This paper addresses the problem of predicting hazards that drivers may
encounter while driving a car. We formulate it as a task of anticipating
impending accidents using a single input image captured by car dashcams. Unlike
existing approaches to driving hazard prediction that rely on computational
simulations or anomaly detection from videos, this study focuses on high-level
inference from static images. The problem needs predicting and reasoning about
future events based on uncertain observations, which falls under visual
abductive reasoning. To enable research in this understudied area, a new
dataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is
created. The dataset consists of 15K dashcam images of street scenes, and each
image is associated with a tuple containing car speed, a hypothesized hazard
description, and visual entities present in the scene. These are annotated by
human annotators, who identify risky scenes and provide descriptions of
potential accidents that could occur a few seconds later. We present several
baseline methods and evaluate their performance on our dataset, identifying
remaining issues and discussing future directions. This study contributes to
the field by introducing a novel problem formulation and dataset, enabling
researchers to explore the potential of multi-modal AI for driving hazard
prediction.
comment: Main Paper: 11 pages, Supplementary Materials: 25 pages
♻ ☆ PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM
Layout generation is the keystone in achieving automated graphic design,
requiring arranging the position and size of various multi-modal design
elements in a visually pleasing and constraint-following manner. Previous
approaches are either inefficient for large-scale applications or lack
flexibility for varying design requirements. Our research introduces a unified
framework for automated graphic layout generation, leveraging the multi-modal
large language model (MLLM) to accommodate diverse design tasks. In contrast,
our data-driven method employs structured text (JSON format) and visual
instruction tuning to generate layouts under specific visual and textual
constraints, including user-defined natural language specifications. We
conducted extensive experiments and achieved state-of-the-art (SOTA)
performance on public multi-modal layout generation benchmarks, demonstrating
the effectiveness of our method. Moreover, recognizing existing datasets'
limitations in capturing the complexity of real-world graphic designs, we
propose two new datasets for much more challenging tasks (user-constrained
generation and complicated poster), further validating our model's utility in
real-life settings. Marking by its superior accessibility and adaptability,
this approach further automates large-scale graphic design tasks. The code and
datasets will be publicly available on
https://github.com/posterllava/PosterLLaVA.
comment: 10 pages; typos corrected, appendix added
♻ ☆ DynamicGlue: Epipolar and Time-Informed Data Association in Dynamic Environments using Graph Neural Networks
The assumption of a static environment is common in many geometric computer
vision tasks like SLAM but limits their applicability in highly dynamic scenes.
Since these tasks rely on identifying point correspondences between input
images within the static part of the environment, we propose a graph neural
network-based sparse feature matching network designed to perform robust
matching under challenging conditions while excluding keypoints on moving
objects. We employ a similar scheme of attentional aggregation over graph edges
to enhance keypoint representations as state-of-the-art feature-matching
networks but augment the graph with epipolar and temporal information and
vastly reduce the number of graph edges. Furthermore, we introduce a
self-supervised training scheme to extract pseudo labels for image pairs in
dynamic environments from exclusively unprocessed visual-inertial data. A
series of experiments show the superior performance of our network as it
excludes keypoints on moving objects compared to state-of-the-art feature
matching networks while still achieving similar results regarding conventional
matching metrics. When integrated into a SLAM system, our network significantly
improves performance, especially in highly dynamic scenes.
♻ ☆ E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion
Online GUI navigation on mobile devices has driven a lot of attention recent
years since it contributes to many real-world applications. With the rapid
development of large language models (LLM), multimodal large language models
(MLLM) have tremendous potential on this task. However, existing MLLMs need
high quality data to improve its abilities of making the correct navigation
decisions according to the human user inputs. In this paper, we developed a
novel and highly valuable dataset, named \textbf{E-ANT}, as the first Chinese
GUI navigation dataset that contains real human behaviour and high quality
screenshots with annotations, containing nearly 40,000 real human traces over
5000+ different tinyAPPs. Furthermore, we evaluate various powerful MLLMs on
E-ANT and show their experiments results with sufficient ablations. We believe
that our proposed dataset will be beneficial for both the evaluation and
development of GUI navigation and LLM/MLLM decision-making capabilities.
comment: 9 pages, 5 figures, Under review
♻ ☆ Training morphological neural networks with gradient descent: some theoretical insights
Morphological neural networks, or layers, can be a powerful tool to boost the
progress in mathematical morphology, either on theoretical aspects such as the
representation of complete lattice operators, or in the development of image
processing pipelines. However, these architectures turn out to be difficult to
train when they count more than a few morphological layers, at least within
popular machine learning frameworks which use gradient descent based
optimization algorithms. In this paper we investigate the potential and
limitations of differentiation based approaches and back-propagation applied to
morphological networks, in light of the non-smooth optimization concept of
Bouligand derivative. We provide insights and first theoretical guidelines, in
particular regarding initialization and learning rates.
♻ ☆ YOLOv10 to Its Genesis: A Decadal and Comprehensive Review of The You Only Look Once Series
Ranjan Sapkota, Rizwan Qureshi, Marco Flores Calero, Chetan Badjugar, Upesh Nepal, Alwin Poulose, Peter Zeno, Uday Bhanu Prakash Vaddevolu, Hong Yan, Manoj Karkee
This review systematically examines the progression of the You Only Look Once
(YOLO) object detection algorithms from YOLOv1 to the recently unveiled
YOLOv10. Employing a reverse chronological analysis, this study examines the
advancements introduced by YOLO algorithms, beginning with YOLOv10 and
progressing through YOLOv9, YOLOv8, and subsequent versions to explore each
version's contributions to enhancing speed, accuracy, and computational
efficiency in real-time object detection. The study highlights the
transformative impact of YOLO across five critical application areas:
automotive safety, healthcare, industrial manufacturing, surveillance, and
agriculture. By detailing the incremental technological advancements in
subsequent YOLO versions, this review chronicles the evolution of YOLO, and
discusses the challenges and limitations in each earlier versions. The
evolution signifies a path towards integrating YOLO with multimodal,
context-aware, and General Artificial Intelligence (AGI) systems for the next
YOLO decade, promising significant implications for future developments in
AI-driven applications.
comment: 11 Figures, 7 Tables
♻ ☆ A Simple Framework for Open-Vocabulary Zero-Shot Segmentation
Thomas Stegmüller, Tim Lebailly, Nikola Dukic, Behzad Bozorgtabar, Tinne Tuytelaars, Jean-Philippe Thiran
Zero-shot classification capabilities naturally arise in models trained
within a vision-language contrastive framework. Despite their classification
prowess, these models struggle in dense tasks like zero-shot open-vocabulary
segmentation. This deficiency is often attributed to the absence of
localization cues in captions and the intertwined nature of the learning
process, which encompasses both image representation learning and
cross-modality alignment. To tackle these issues, we propose SimZSS, a Simple
framework for open-vocabulary Zero-Shot Segmentation. The method is founded on
two key principles: i) leveraging frozen vision-only models that exhibit
spatial awareness while exclusively aligning the text encoder and ii)
exploiting the discrete nature of text and linguistic knowledge to pinpoint
local concepts within captions. By capitalizing on the quality of the visual
representations, our method requires only image-caption pairs datasets and
adapts to both small curated and large-scale noisy datasets. When trained on
COCO Captions across 8 GPUs, SimZSS achieves state-of-the-art results on 7 out
of 8 benchmark datasets in less than 15 minutes.
♻ ☆ VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
Video Temporal Grounding (VTG) focuses on accurately identifying event
timestamps within a particular video based on a linguistic query, playing a
vital role in downstream tasks such as video browsing and editing. While Video
Large Language Models (video LLMs) have made significant progress in
understanding video content, they often face challenges in accurately
pinpointing timestamps within videos, which limits their performance on VTG
tasks. Therefore, to improve video LLMs' ability to effectively locate
timestamps, we argue that two critical aspects need to be enhanced. First, it
is essential to have high-quality instructional tuning datasets that encompass
mainstream VTG tasks. Second, directly incorporating timestamp knowledge into
video LLMs is crucial, as it enables models to efficiently comprehend timestamp
information. To address these needs, we first introduce VTG-IT-120K, a
high-quality and comprehensive instruction tuning dataset that covers VTG tasks
such as moment retrieval, dense video captioning, video summarization, and
video highlight detection. Furthermore, we propose a specially designed video
LLM model for VTG tasks, VTG-LLM, which (1) effectively integrates timestamp
knowledge into visual tokens; (2) incorporates absolute-time tokens that
specifically handle timestamp knowledge, thereby avoiding concept shifts; and
(3) introduces a lightweight, high-performance slot-based token compression
method to facilitate the sampling of more video frames. Comprehensive
experiments showcase the superior performance of VTG-LLM in comparison to other
video LLM methods across various VTG tasks. Our code and datasets are available
at \url{https://github.com/gyxxyg/VTG-LLM}.
♻ ☆ RoadFormer: Duplex Transformer for RGB-Normal Semantic Road Scene Parsing
The recent advancements in deep convolutional neural networks have shown
significant promise in the domain of road scene parsing. Nevertheless, the
existing works focus primarily on freespace detection, with little attention
given to hazardous road defects that could compromise both driving safety and
comfort. In this paper, we introduce RoadFormer, a novel Transformer-based
data-fusion network developed for road scene parsing. RoadFormer utilizes a
duplex encoder architecture to extract heterogeneous features from both RGB
images and surface normal information. The encoded features are subsequently
fed into a novel heterogeneous feature synergy block for effective feature
fusion and recalibration. The pixel decoder then learns multi-scale long-range
dependencies from the fused and recalibrated heterogeneous features, which are
subsequently processed by a Transformer decoder to produce the final semantic
prediction. Additionally, we release SYN-UDTIRI, the first large-scale road
scene parsing dataset that contains over 10,407 RGB images, dense depth images,
and the corresponding pixel-level annotations for both freespace and road
defects of different shapes and sizes. Extensive experimental evaluations
conducted on our SYN-UDTIRI dataset, as well as on three public datasets,
including KITTI road, CityScapes, and ORFD, demonstrate that RoadFormer
outperforms all other state-of-the-art networks for road scene parsing.
Specifically, RoadFormer ranks first on the KITTI road benchmark. Our source
code, created dataset, and demo video are publicly available at
mias.group/RoadFormer.
comment: 10 pages 7 figures. Accepted by Transactions on Intelligent Vehicles
♻ ☆ 3D Human Mesh Estimation from Virtual Markers CVPR 2023
Inspired by the success of volumetric 3D pose estimation, some recent human
mesh estimators propose to estimate 3D skeletons as intermediate
representations, from which, the dense 3D meshes are regressed by exploiting
the mesh topology. However, body shape information is lost in extracting
skeletons, leading to mediocre performance. The advanced motion capture systems
solve the problem by placing dense physical markers on the body surface, which
allows to extract realistic meshes from their non-rigid motions. However, they
cannot be applied to wild images without markers. In this work, we present an
intermediate representation, named virtual markers, which learns 64 landmark
keypoints on the body surface based on the large-scale mocap data in a
generative style, mimicking the effects of physical markers. The virtual
markers can be accurately detected from wild images and can reconstruct the
intact meshes with realistic shapes by simple interpolation. Our approach
outperforms the state-of-the-art methods on three datasets. In particular, it
surpasses the existing methods by a notable margin on the SURREAL dataset,
which has diverse body shapes. Code is available at
https://github.com/ShirleyMaxx/VirtualMarker
comment: CVPR 2023
♻ ☆ SemanticFormer: Holistic and Semantic Traffic Scene Representation for Trajectory Prediction using Knowledge Graphs
Trajectory prediction in autonomous driving relies on accurate representation
of all relevant contexts of the driving scene, including traffic participants,
road topology, traffic signs, as well as their semantic relations to each
other. Despite increased attention to this issue, most approaches in trajectory
prediction do not consider all of these factors sufficiently. We present
SemanticFormer, an approach for predicting multimodal trajectories by reasoning
over a semantic traffic scene graph using a hybrid approach. It utilizes
high-level information in the form of meta-paths, i.e. trajectories on which an
agent is allowed to drive from a knowledge graph which is then processed by a
novel pipeline based on multiple attention mechanisms to predict accurate
trajectories. SemanticFormer comprises a hierarchical heterogeneous graph
encoder to capture spatio-temporal and relational information across agents as
well as between agents and road elements. Further, it includes a predictor to
fuse different encodings and decode trajectories with probabilities. Finally, a
refinement module assesses permitted meta-paths of trajectories and speed
profiles to obtain final predicted trajectories. Evaluation of the nuScenes
benchmark demonstrates improved performance compared to several SOTA methods.
In addition, we demonstrate that our knowledge graph can be easily added to two
graph-based existing SOTA methods, namely VectorNet and Laformer, replacing
their original homogeneous graphs. The evaluation results suggest that by
adding our knowledge graph the performance of the original methods is enhanced
by 5% and 4%, respectively.
comment: 8 pages, 7 figures, has been accepted for publication in the IEEE
Robotics and Automation Letters (RA-L)
♻ ☆ DifAttack++: Query-Efficient Black-Box Adversarial Attack via Hierarchical Disentangled Feature Space in Cross-Domain AAAI24
This work investigates efficient score-based black-box adversarial attacks
with a high Attack Success Rate (\textbf{ASR}) and good generalizability. We
design a novel attack method based on a hierarchical DIsentangled Feature
space, called \textbf{DifAttack++}, which differs significantly from the
existing ones operating over the entire feature space. Specifically,
DifAttack++ firstly disentangles an image's latent feature into an Adversarial
Feature (\textbf{AF}) and a Visual Feature (\textbf{VF}) via an autoencoder
equipped with our specially designed Hierarchical Decouple-Fusion
(\textbf{HDF}) module, where the AF dominates the adversarial capability of an
image, while the VF largely determines its visual appearance. We train such two
autoencoders for the clean and adversarial image domains (i.e., cross-domain)
respectively to achieve image reconstructions and feature disentanglement, by
using pairs of clean images and their Adversarial Examples (\textbf{AE}s)
generated from available surrogate models via white-box attack methods.
Eventually, in the black-box attack stage, DifAttack++ iteratively optimizes
the AF according to the query feedback from the victim model until a successful
AE is generated, while keeping the VF unaltered. Extensive experimental results
demonstrate that our DifAttack++ leads to superior ASR and query efficiency
than state-of-the-art methods, meanwhile exhibiting much better visual quality
of AEs. The code is available at https://github.com/csjunjun/DifAttack.git.
comment: arXiv admin note: substantial text overlap with arXiv:2309.14585 An
extension of the AAAI24 paper "DifAttack: Query-Efficient Black-Box Attack
via Disentangled Feature Space."
♻ ☆ ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
Image-to-video (I2V) generation aims to use the initial frame (alongside a
text prompt) to create a video sequence. A grand challenge in I2V generation is
to maintain visual consistency throughout the video: existing methods often
struggle to preserve the integrity of the subject, background, and style from
the first frame, as well as ensure a fluid and logical progression within the
video narrative. To mitigate these issues, we propose ConsistI2V, a
diffusion-based method to enhance visual consistency for I2V generation.
Specifically, we introduce (1) spatiotemporal attention over the first frame to
maintain spatial and motion consistency, (2) noise initialization from the
low-frequency band of the first frame to enhance layout consistency. These two
approaches enable ConsistI2V to generate highly consistent videos. We also
extend the proposed approaches to show their potential to improve consistency
in auto-regressive long video generation and camera motion control. To verify
the effectiveness of our method, we propose I2V-Bench, a comprehensive
evaluation benchmark for I2V generation. Our automatic and human evaluation
results demonstrate the superiority of ConsistI2V over existing methods.
comment: Project Page: https://tiger-ai-lab.github.io/ConsistI2V/
♻ ☆ Deep Active Audio Feature Learning in Resource-Constrained Environments
The scarcity of labelled data makes training Deep Neural Network (DNN) models
in bioacoustic applications challenging. In typical bioacoustics applications,
manually labelling the required amount of data can be prohibitively expensive.
To effectively identify both new and current classes, DNN models must continue
to learn new features from a modest amount of fresh data. Active Learning (AL)
is an approach that can help with this learning while requiring little
labelling effort. Nevertheless, the use of fixed feature extraction approaches
limits feature quality, resulting in underutilization of the benefits of AL. We
describe an AL framework that addresses this issue by incorporating feature
extraction into the AL loop and refining the feature extractor after each round
of manual annotation. In addition, we use raw audio processing rather than
spectrograms, which is a novel approach. Experiments reveal that the proposed
AL framework requires 14.3%, 66.7%, and 47.4% less labelling effort on
benchmark audio datasets ESC-50, UrbanSound8k, and InsectWingBeat,
respectively, for a large DNN model and similar savings on a
microcontroller-based counterpart. Furthermore, we showcase the practical
relevance of our study by incorporating data from conservation biology
projects. All codes are publicly available on GitHub.
♻ ☆ Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach
Yansheng Li, Linlin Wang, Tingzhu Wang, Xue Yang, Junwei Luo, Qi Wang, Youming Deng, Wenbin Wang, Xian Sun, Haifeng Li, Bo Dang, Yongjun Zhang, Yi Yu, Junchi Yan
Scene graph generation (SGG) in satellite imagery (SAI) benefits promoting
intelligent understanding of geospatial scenarios from perception to cognition.
In SAI, objects exhibit great variations in scales and aspect ratios, and there
exist rich relationships between objects (even between spatially disjoint
objects), which makes it necessary to holistically conduct SGG in large-size
very-high-resolution (VHR) SAI. However, the lack of SGG datasets with
large-size VHR SAI has constrained the advancement of SGG in SAI. Due to the
complexity of large-size VHR SAI, mining triplets in large-size VHR SAI heavily relies on long-range contextual
reasoning. Consequently, SGG models designed for small-size natural imagery are
not directly applicable to large-size VHR SAI. To address the scarcity of
datasets, this paper constructs a large-scale dataset for SGG in large-size VHR
SAI with image sizes ranging from 512 x 768 to 27,860 x 31,096 pixels, named
RSG, encompassing over 210,000 objects and more than 400,000 triplets. To
realize SGG in large-size VHR SAI, we propose a context-aware cascade cognition
(CAC) framework to understand SAI at three levels: object detection (OBD), pair
pruning and relationship prediction. As a fundamental prerequisite for SGG in
large-size SAI, a holistic multi-class object detection network (HOD-Net) that
can flexibly integrate multi-scale contexts is proposed. With the consideration
that there exist a huge amount of object pairs in large-size SAI but only a
minority of object pairs contain meaningful relationships, we design a pair
proposal generation (PPG) network via adversarial reconstruction to select
high-value pairs. Furthermore, a relationship prediction network with
context-aware messaging (RPCM) is proposed to predict the relationship types of
these pairs.
comment: This paper releases a SAI-oriented SGG toolkit with about 30 OBD
methods and 10 SGG methods, and develops a benchmark based on RSG where our
HOD-Net and RPCM significantly outperform the state-of-the-art methods in
both OBD and SGG tasks. The RSG dataset and SAI-oriented toolkit will be made
publicly available at https://linlin-dev.github.io/project/RSG
♻ ☆ Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, Ziwei Liu
Video sequences offer valuable temporal information, but existing large
multimodal models (LMMs) fall short in understanding extremely long videos.
Many works address this by reducing the number of visual tokens using visual
resamplers. Alternatively, in this paper, we approach this problem from the
perspective of the language model. By simply extrapolating the context length
of the language backbone, we enable LMMs to comprehend orders of magnitude more
visual tokens without any video training. We call this phenomenon long context
transfer and carefully ablate its properties. To effectively measure LMMs'
ability to generalize to long contexts in the vision modality, we develop
V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark
inspired by the language model's NIAH test. Our proposed Long Video Assistant
(LongVA) can process 2000 frames or over 200K visual tokens without additional
complexities. With its extended context length, LongVA achieves
state-of-the-art performance on Video-MME among 7B-scale models by densely
sampling more input frames. Our work is open-sourced at
https://github.com/EvolvingLMMs-Lab/LongVA.
comment: Code, demo, and models are available at
https://github.com/EvolvingLMMs-Lab/LongVA
♻ ☆ Video Anomaly Detection in 10 Years: A Survey and Outlook
Video anomaly detection (VAD) holds immense importance across diverse domains
such as surveillance, healthcare, and environmental monitoring. While numerous
surveys focus on conventional VAD methods, they often lack depth in exploring
specific approaches and emerging trends. This survey explores deep
learning-based VAD, expanding beyond traditional supervised training paradigms
to encompass emerging weakly supervised, self-supervised, and unsupervised
approaches. A prominent feature of this review is the investigation of core
challenges within the VAD paradigms including large-scale datasets, features
extraction, learning methods, loss functions, regularization, and anomaly score
prediction. Moreover, this review also investigates the vision language models
(VLMs) as potent feature extractors for VAD. VLMs integrate visual data with
textual descriptions or spoken language from videos, enabling a nuanced
understanding of scenes crucial for anomaly detection. By addressing these
challenges and proposing future research directions, this review aims to foster
the development of robust and efficient VAD systems leveraging the capabilities
of VLMs for enhanced anomaly detection in complex real-world scenarios. This
comprehensive analysis seeks to bridge existing knowledge gaps, provide
researchers with valuable insights, and contribute to shaping the future of VAD
research.
♻ ☆ Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images CVPR 2024
A long-standing challenge in developing machine learning approaches has been
the lack of high-quality labeled data. Recently, models trained with purely
synthetic data, here termed synthetic clones, generated using large-scale
pre-trained diffusion models have shown promising results in overcoming this
annotation bottleneck. As these synthetic clone models progress, they are
likely to be deployed in challenging real-world settings, yet their suitability
remains understudied. Our work addresses this gap by providing the first
benchmark for three classes of synthetic clone models, namely supervised,
self-supervised, and multi-modal ones, across a range of robustness measures.
We show that existing synthetic self-supervised and multi-modal clones are
comparable to or outperform state-of-the-art real-image baselines for a range
of robustness metrics - shape bias, background bias, calibration, etc. However,
we also find that synthetic clones are much more susceptible to adversarial and
real-world noise than models trained with real data. To address this, we find
that combining both real and synthetic data further increases the robustness,
and that the choice of prompt used for generating synthetic images plays an
important part in the robustness of synthetic clones.
comment: Accepted at CVPR 2024 Workshop: SyntaGen-Harnessing Generative Models
for Synthetic Visual Datasets. Project page at
https://synbenchmark.github.io/SynCloneBenchmark Comments: Fix typo in Fig. 1
♻ ☆ SketchQL Demonstration: Zero-shot Video Moment Querying with Sketches
Renzhi Wu, Pramod Chunduri, Dristi J Shah, Ashmitha Julius Aravind, Ali Payani, Xu Chu, Joy Arulraj, Kexin Rong
In this paper, we will present SketchQL, a video database management system
(VDBMS) for retrieving video moments with a sketch-based query interface. This
novel interface allows users to specify object trajectory events with simple
mouse drag-and-drop operations. Users can use trajectories of single objects as
building blocks to compose complex events. Using a pre-trained model that
encodes trajectory similarity, SketchQL achieves zero-shot video moments
retrieval by performing similarity searches over the video to identify clips
that are the most similar to the visual query. In this demonstration, we
introduce the graphic user interface of SketchQL and detail its functionalities
and interaction mechanisms. We also demonstrate the end-to-end usage of
SketchQL from query composition to video moments retrieval using real-world
scenarios.
♻ ☆ Geometry-Aware Score Distillation via 3D Consistent Noising and Gradient Consistency Modeling
Score distillation sampling (SDS), the methodology in which the score from
pretrained 2D diffusion models is distilled into 3D representation, has
recently brought significant advancements in text-to-3D generation task.
However, this approach is still confronted with critical geometric
inconsistency problems such as the Janus problem. Starting from a hypothesis
that such inconsistency problems may be induced by multiview inconsistencies
between 2D scores predicted from various viewpoints, we introduce GSD, a simple
and general plug-and-play framework for incorporating 3D consistency and
therefore geometry awareness into the SDS process. Our methodology is composed
of three components: 3D consistent noising, designed to produce 3D consistent
noise maps that perfectly follow the standard Gaussian distribution,
geometry-based gradient warping for identifying correspondences between
predicted gradients of different viewpoints, and novel gradient consistency
loss to optimize the scene geometry toward producing more consistent gradients.
We demonstrate that our method significantly improves performance, successfully
addressing the geometric inconsistency problems in text-to-3D generation task
with minimal computation cost and being compatible with existing score
distillation-based models. Our project page is available at
https://ku-cvlab.github.io/GSD/.